fix: add short-name package for E1 judge env by intertwine · Pull Request #71 · intertwine/security-verifiers

intertwine · 2026-03-06T17:07:21Z

Summary

add a dedicated short-name judge environment package at environments/sv-env-netlogs-judge
repoint configs/rl/e1_judge.toml to intertwine/sv-netlogs-judge
document the Prime 20-char label truncation bug that breaks sv-env-network-logs-judge

Why

Prime truncates long environment names when deriving Kubernetes labels. The original judge env name truncates to sv-env-network-logs-, which is invalid because it ends with -. A simple Hub rename was not enough because Prime also expects the installed distribution metadata to match the short name.

This PR publishes the judge variant as a real short-name package (sv-netlogs-judge) so hosted RL runs can both install and launch cleanly.

Test Plan

pytest environments/sv-env-netlogs-judge/sv_netlogs_judge_test.py -q
pushed intertwine/sv-netlogs-judge v0.2.17 and verified Prime integration action succeeds
launched hosted judge run on Prime using OpenRouter-backed OpenAI-compatible env vars plus HF dataset access
confirmed live run: i5fvjomdr3yjn3pg6y7x3nu8

Notes

active W&B run: intertwine/svbench-e1/runs/bhsxeoiv
config now uses intertwine-ai/security-verifiers-e1 via E1_HF_REPO for hosted dataset loading

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a22d596920

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-06T17:12:29Z


 # Optional Hub environment name override (defaults to pyproject.toml name)
-# Usage: make hub-deploy E=network-logs NAME=sv-env-network-logs-judge
+# Usage: make hub-deploy E=network-logs NAME=sv-netlogs-judge


Use the short-package path in hub-deploy usage example

The new usage comment points to make hub-deploy E=network-logs NAME=sv-netlogs-judge, but the deploy recipe builds from environments/sv-env-$(E), so that command still packages the old sv-env-network-logs distribution instead of sv-env-netlogs-judge. Following this instruction can reintroduce the package-metadata mismatch this commit is trying to avoid and lead to failed installs/runs for the short ID.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-06T17:12:29Z

+
+    if logger and logger.enabled:
+        logger.log_environment_init(
+            environment_name="sv-env-network-logs-judge",


Align logged environment name with the short alias

The short-name entrypoint wraps this loader, but log_environment_init still emits environment_name="sv-env-network-logs-judge" here before the wrapper renames the returned env object. In runs where rollout logging is enabled, sv-netlogs-judge executions will be recorded under the old ID, which can mix telemetry across env IDs and make experiment attribution unreliable.

Useful? React with 👍 / 👎.

intertwine · 2026-03-07T18:12:45Z

@claude review

claude · 2026-03-07T18:13:01Z

Claude finished @intertwine's task in 2m 37s —— View job

Code Review

Gather context and read changed files
Review new package structure (sv-env-netlogs-judge)
Review entrypoint and loader logic
Review test coverage
Review Makefile and config changes
Summarize findings

Summary

The approach is sound — publishing a short-name package to work around Prime's 20-character Kubernetes label truncation bug is the right move, and the wrapper pattern in sv_netlogs_judge.py is clean. Two issues need fixes before merge, and a few minor points worth addressing.

Issues

🔴 P1 — Makefile comment still points to the wrong source env

Makefile:39

# Usage: make hub-deploy E=network-logs NAME=sv-netlogs-judge

The deploy target does cd environments/sv-env-$(E), so E=network-logs builds and pushes from environments/sv-env-network-logs (the base classifier), not the new judge package. The NAME override only changes the Hub label sent to prime env push --name, not the package source. This is the exact package-metadata mismatch the PR is trying to prevent.

The correct invocation is:

# Usage: make hub-deploy E=netlogs-judge

No NAME override is needed because pyproject.toml already declares name = "sv-netlogs-judge". Fix this →

🟡 P2 — Telemetry is logged under the old ID before the wrapper renames it

sv_env_network_logs_judge.py:267-276

logger.log_environment_init(
    environment_name="sv-env-network-logs-judge",  # ← logged here
    ...
)

sv_netlogs_judge.py:19-21

env = _load_environment(**kwargs)
env.name = SHORT_ENV_ID       # ← renamed after the fact
env.env_id = SHORT_ENV_ID

When a logger is passed to the wrapper, log_environment_init fires inside _load_environment with "sv-env-network-logs-judge". By the time the wrapper patches env.name, the telemetry event has already been emitted. In a WandB/Weave run launched via sv-netlogs-judge, the init event will appear under the old ID, mixing telemetry across env IDs.

The simplest fix is to accept the env name as an optional parameter in load_environment:

# sv_env_network_logs_judge.py
def load_environment(
    ...
    env_name: str = "sv-env-network-logs-judge",
    ...
):
    ...
    logger.log_environment_init(environment_name=env_name, ...)
    return vf.SingleTurnEnv(name=env_name, ...)

# sv_netlogs_judge.py
def load_environment(**kwargs):
    kwargs.setdefault("env_name", SHORT_ENV_ID)
    return _load_environment(**kwargs)

Fix this →

Minor Points

Test file included in the wheel (pyproject.toml:43)
sv_netlogs_judge_test.py is listed under [tool.hatch.build] include. Tests shouldn't ship in production wheels — they inflate the artifact and can conflict with the Hub install. Compare with how sv-env-network-logs handles this. Fix this →

PR body test-plan path typo
The test plan says pytest environments/sv-env-netlogs-judge/sv_env_network_logs_judge_test.py but the actual file is sv_netlogs_judge_test.py. Not a code issue, but worth correcting in the description.

env.env_id is non-standard (sv_netlogs_judge.py:21)
env.env_id = SHORT_ENV_ID dynamically adds an attribute that doesn't exist on vf.SingleTurnEnv. If the Verifiers framework doesn't read env_id anywhere, this is a no-op. Worth double-checking whether Prime reads env_id vs name to pick the entrypoint key — if name is sufficient, the env_id assignment can be dropped.

Duplicate sys.path manipulation (sv_env_network_logs_judge.py:20-53)
Two separate try/except import blocks both call sys.path.append in their fallback paths. If the weave_init import succeeds but sv_shared fails (unlikely in practice), the path gets set correctly. No bug, but this could be simplified by computing the path once at module level.

What's Working Well

The wrapper pattern in sv_netlogs_judge.py is minimal and correct — delegates everything to the main loader and patches env.name.
judge_sampling_args={"max_tokens": 16, "temperature": 0.0} is the right configuration for a binary yes/no signal.
Test coverage is solid: parser, prompt placeholders, all four judge response cases (yes, yes with trailing text, no, maybe), environment loading, and the short-alias test.
e1_judge.toml is clearly matched to e1.toml for controlled WP3c comparison — the comment block documenting the reward differences is very useful.
The research doc update correctly points Condition B to the new Hub name.

- wire env_name through the short judge loader for consistent telemetry - rename the short-package implementation module to avoid collisions - fix hub-deploy docs and exclude tests from the wheel

intertwine · 2026-03-07T19:01:38Z

@claude re-review

claude · 2026-03-07T19:01:49Z

Claude finished @intertwine's task in 1m 46s —— View job

Re-review

Gather context
Read changed files
Check if previous issues were addressed
Post updated review

Summary

All P1 and P2 issues from the previous review have been fixed. The two remaining items are minor stale references in the research doc. This PR is ready to merge.

Previous Issues — All Resolved ✅

P1 — Makefile comment (Makefile:39)

# Usage: make hub-deploy E=netlogs-judge

Fixed. The comment now correctly points to E=netlogs-judge, which maps to environments/sv-env-netlogs-judge. No NAME override needed.

P2 — Telemetry logged under old ID
Fixed. sv_netlogs_judge_impl.py now accepts env_name: str = DEFAULT_ENV_NAME and passes it to both log_environment_init and vf.SingleTurnEnv(name=env_name, env_id=env_name, ...). The wrapper sets kwargs.setdefault("env_name", SHORT_ENV_ID) before delegating, so telemetry emits under "sv-netlogs-judge" from the start.

Test file in wheel
Fixed. pyproject.toml [tool.hatch.build] now includes only sv_netlogs_judge.py and sv_netlogs_judge_impl.py — the test file is correctly excluded.

Test coverage for the fix
TestLoadEnvironment.test_short_alias_loads_environment (line 172) now explicitly verifies that env.name, env.env_id, and log_environment_init(environment_name=...) all use "sv-netlogs-judge" when called via the short-name wrapper. This is exactly the regression test the previous review was asking for.

Remaining Minor Items

Stale status checklist in research doc (reward_source_comparison.md:104–119)

Two items are out of date:

- [ ] Deploy judge environment to Hub — The PR body confirms v0.2.17 is live at intertwine/sv-netlogs-judge. This should be checked off.
The Files table still lists environments/sv-env-network-logs/sv_env_network_logs_judge.py as the judge variant file, but the implementation has moved to environments/sv-env-netlogs-judge/sv_netlogs_judge_impl.py.

Neither blocks merge, but worth tidying up. Fix this →

Pre-existing: REPO_ROOT path resolution (sv_netlogs_judge_impl.py:26)

REPO_ROOT = str(Path(__file__).resolve().parents[2])

When installed as a wheel, __file__ is inside site-packages and parents[2] won't resolve to the repo root. This is the same pattern used in the other environments, so it's a pre-existing limitation rather than a regression. The installed package path takes precedence in the try block anyway, so it only matters if sv_shared isn't co-installed — which won't happen in a Prime environment. Not a blocker.

What's Working Well

The implementation is clean and the test suite is comprehensive. In particular, test_short_alias_loads_environment directly validates the core correctness property — that the wrapper correctly threads env_name through so that env.name, env.env_id, and telemetry are all consistent with "sv-netlogs-judge". The judge_sampling_args={"max_tokens": 16, "temperature": 0.0} configuration is appropriate for the binary yes/no signal.

intertwine added 2 commits March 6, 2026 17:06

fix: add short-name package for E1 judge env

a22d596

fix: rename short-package judge test module

45f9816

chatgpt-codex-connector Bot reviewed Mar 6, 2026

View reviewed changes

fix: address PR #71 review feedback

b2f44c7

- wire env_name through the short judge loader for consistent telemetry - rename the short-package implementation module to avoid collisions - fix hub-deploy docs and exclude tests from the wheel

docs: refresh PR #71 research notes

e2220ae

intertwine merged commit 3e66fd0 into main Mar 7, 2026
5 checks passed

intertwine deleted the fix/e1-judge-short-package branch March 7, 2026 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add short-name package for E1 judge env#71

fix: add short-name package for E1 judge env#71
intertwine merged 4 commits intomainfrom
fix/e1-judge-short-package

intertwine commented Mar 6, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 6, 2026

Uh oh!

chatgpt-codex-connector Bot Mar 6, 2026

Uh oh!

intertwine commented Mar 7, 2026

Uh oh!

claude Bot commented Mar 7, 2026 •

edited

Loading

Uh oh!

intertwine commented Mar 7, 2026

Uh oh!

claude Bot commented Mar 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

intertwine commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Test Plan

Notes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

intertwine commented Mar 7, 2026

Uh oh!

claude Bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Summary

Issues

🔴 P1 — Makefile comment still points to the wrong source env

🟡 P2 — Telemetry is logged under the old ID before the wrapper renames it

Minor Points

What's Working Well

Uh oh!

intertwine commented Mar 7, 2026

Uh oh!

claude Bot commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review

Summary

Previous Issues — All Resolved ✅

Remaining Minor Items

What's Working Well

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

intertwine commented Mar 6, 2026 •

edited

Loading

claude Bot commented Mar 7, 2026 •

edited

Loading

claude Bot commented Mar 7, 2026 •

edited

Loading